Vocabulary Selection for a Broadcast News Transcription System using a Morpho-syntatic Approach
نویسندگان
چکیده
Although the vocabularies of ASR systems are designed to achieve high coverage for the expected domain, out-ofvocabulary (OOV) words cannot be avoided. Particularly, for daily and real-time transcription of Broadcast News (BN) data in highly inflected languages, the rapid vocabulary growth leads to high OOV word rates. To overcome this problem, we present a new morpho-syntatic approach to dynamically select the target vocabulary for this particular domain by trading off between the OOV word rate and vocabulary size. We evaluate this approach against the common selection strategy based on word frequency. Experiments have been carried out for a European Portuguese BN transcription system. Results computed on seven news shows, yields a relative reduction of 37.8% in OOV word rate against the baseline system and 5.5% when compared with the word frequency common approach.
منابع مشابه
Vocabulary selection for a broadcast news transcription system using a morpho-syntactic approach
Although the vocabularies of ASR systems are designed to achieve high coverage for the expected domain, out-ofvocabulary (OOV) words cannot be avoided. Particularly, for daily and real-time transcription of Broadcast News (BN) data in highly inflected languages, the rapid vocabulary growth leads to high OOV word rates. To overcome this problem, we present a new morpho-syntatic approach to dynam...
متن کاملAutomatic estimation of language model parameters for unseen words using morpho-syntactic contextual information
Various information sources naturally contains new words that appear in a daily basis and which are not present in the vocabulary of the speech recognition system but are important for applications such as closed-captioning or information dissemination. To be recognized, those words need to be included in the vocabulary and the language model (LM) parameters updated. In this context, we propose...
متن کاملTitle generation for spoken broadcast news using a training corpus
The problem of title generation involves finding the essence of a document and expressing it in only a few words. The results of a query to the Informedia Digital Video Library are summarized through an automatically generated title for each retrieved news story. When the document is errorful, as with speech-recognized broadcast news stories, the title creation challenge becomes even greater. W...
متن کاملDomain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament
The main goal of this work is the adaptation of a broadcast news transcription system to a new domain, namely, the Portuguese Parliament plenary meetings. This paper describes the different domain adaptation steps that lowered our baseline absolute word error rate from 20.1% to 16.1%. These steps include the vocabulary selection, in order to include specific domain terms, language model adaptat...
متن کاملDynamic language modeling for European Portuguese
This paper reports on the work done on vocabulary and language model daily adaptation for a European Portuguese broadcast news transcription system. The proposed adaptation framework takes into consideration European Portuguese language characteristics, such as its high level of inflection and complex verbal system. A multi-pass speech recognition framework using contemporary written texts avai...
متن کامل